11/21/2020

Original Premise

  • We originally wanted to set up a scouting report for every batter in the league, using some custom statistics derived from advanced batting statistics. We would then produce a heatmap and detailed scouting report of what sorts of pitches work on every batter, and where in the zone it worked. Gleefully, we set about finding our data.
  • Our data exists, but isn’t cheap. Back to the drawing board.
  • Instead, we decided to see if we could predict the net number of wins a pitcher was worth based on his ERA (earned run average), OBP (Opponent On-base percentage), and whatever else could think of. We found a package called Lahman that would do the job.

Beginning Research

  • Here, we’re going to look at how we selected and in some cases created the column data we needed.
#Here, we look at the relevant data from Lahman

#Select columns from the Pitching dataset
pitchers <- select(tibble(Pitching), playerID, yearID, teamID, IPouts, BB, SO, BAOpp, ERA, W, L)


#Create a Net Wins column
pitchers <- pitchers %>% mutate(NetWins = W-L)

#Only keep rows where there is no missing data
pitchers <- pitchers[complete.cases(pitchers),]

#Normalize data so that coefficients are meaningful
pitchers <- pitchers %>% mutate(normIPouts = (IPouts - mean(IPouts)) / sd(IPouts))
pitchers <- pitchers %>% mutate(normBB = (BB - mean(BB)) / sd(BB))
pitchers <- pitchers %>% mutate(normSO = (SO - mean(SO)) / sd(SO))
pitchers <- pitchers %>% mutate(normBAOpp = (BAOpp - mean(BAOpp)) / sd(BAOpp))
pitchers <- pitchers %>% mutate(normERA = (ERA - mean(ERA)) / sd(ERA))

Building linear model

-Finally, we can build our model, as is accomplished below.

#Build Linear Model
mylm <- lm(NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA, data = pitchers)

#Analyze Findings
summary(mylm)
## 
## Call:
## lm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + 
##     normERA, data = pitchers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0234  -1.4264   0.3797   1.3818  22.3251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.288 on 43110 degrees of freedom
## Multiple R-squared:  0.1389, Adjusted R-squared:  0.1388 
## F-statistic:  1391 on 5 and 43110 DF,  p-value: < 2.2e-16
summary(mylm)$r.squared 
## [1] 0.1389159

Building a Generalized Additive Model

-That last model sucked. Let’s try again with a better model.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.139   Deviance explained = 13.9%
## GCV =  10.81  Scale est. = 10.809    n = 43116
## NULL

##Building a Generalized Linear Model

-That last model sucked too. Let’s try another one–a generalized linear model.

## 
## Call:
## glm(formula = NetWins ~ normIPouts + normBB + normSO + normBAOpp + 
##     normERA, family = gaussian, data = pitchers)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -22.0234   -1.4264    0.3797    1.3818   22.3251  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002319  0.0158333   0.015   0.9883    
## normIPouts   1.3883292  0.0431626  32.165  < 2e-16 ***
## normBB      -1.8568112  0.0353850 -52.475  < 2e-16 ***
## normSO       1.2529491  0.0316495  39.588  < 2e-16 ***
## normBAOpp   -0.0398335  0.0159113  -2.503   0.0123 *  
## normERA     -0.1265619  0.0163386  -7.746 9.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 10.80891)
## 
##     Null deviance: 541146  on 43115  degrees of freedom
## Residual deviance: 465972  on 43110  degrees of freedom
## AIC: 224998
## 
## Number of Fisher Scoring iterations: 2
## NULL

Building models

-Wow, that went well. Let’s see if we can figure out any kind of a model that works at all–even ones that are nonlinear and completely unintuitive to normal humans.

-We’re going to throw everything at it. We are inevitable.

-Put on that infinity glove and snap your fingers

After the MCU

  • Well look at that, we snapped our fingers and explained just over half of the variation. Excellent work team!

  • Just kidding, we need to be able to do a lot better than this.

  • Clearly, this isn’t going as well as it might’ve. Let’s make some graphs and see what we can find before we build another model.

Crossplot

Interesting Things

To me, the most interesting thing was strikeouts by OBP, so I made an interactive graph.

Interactive OBP x Strikeouts

Another interactive graph

Another interactive graph

Advanced pitching metrics

  • We’re going to look at projecting OBP and Wins using some advanced metrics.

Fastball speed

Hard Hit Percent

Exit Velocity

Launch Angle

Barrel Batted Rate

Spin Speed on Breaking Ball

## Warning: Removed 1 rows containing missing values (geom_point).

Sweet Spot Percent

New Model

  • We did find some statistics that had positive trends. Are the significant enough to be useful? Let’s find out!

  • mylm <- lm(on_base_percent ~ hard_hit_percent * sweet_spot_percent * exit_velocity_avg * year, data = advancedPitchers)

## 
## Call:
## lm(formula = on_base_percent ~ hard_hit_percent * sweet_spot_percent * 
##     exit_velocity_avg * year, data = advancedPitchers)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.072682 -0.016905  0.001168  0.018066  0.063968 
## 
## Coefficients:
##                                                              Estimate
## (Intercept)                                                 2.975e+03
## hard_hit_percent                                           -8.083e+01
## sweet_spot_percent                                         -8.295e+01
## exit_velocity_avg                                          -3.629e+01
## year                                                       -1.482e+00
## hard_hit_percent:sweet_spot_percent                         2.528e+00
## hard_hit_percent:exit_velocity_avg                          1.007e+00
## sweet_spot_percent:exit_velocity_avg                        1.008e+00
## hard_hit_percent:year                                       4.033e-02
## sweet_spot_percent:year                                     4.131e-02
## exit_velocity_avg:year                                      1.807e-02
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg      -3.102e-02
## hard_hit_percent:sweet_spot_percent:year                   -1.260e-03
## hard_hit_percent:exit_velocity_avg:year                    -5.020e-04
## sweet_spot_percent:exit_velocity_avg:year                  -5.018e-04
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg:year  1.546e-05
##                                                            Std. Error t value
## (Intercept)                                                 1.219e+04   0.244
## hard_hit_percent                                            3.466e+02  -0.233
## sweet_spot_percent                                          3.810e+02  -0.218
## exit_velocity_avg                                           1.389e+02  -0.261
## year                                                        6.045e+00  -0.245
## hard_hit_percent:sweet_spot_percent                         1.084e+01   0.233
## hard_hit_percent:exit_velocity_avg                          3.934e+00   0.256
## sweet_spot_percent:exit_velocity_avg                        4.341e+00   0.232
## hard_hit_percent:year                                       1.719e-01   0.235
## sweet_spot_percent:year                                     1.889e-01   0.219
## exit_velocity_avg:year                                      6.890e-02   0.262
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg       1.230e-01  -0.252
## hard_hit_percent:sweet_spot_percent:year                    5.373e-03  -0.235
## hard_hit_percent:exit_velocity_avg:year                     1.951e-03  -0.257
## sweet_spot_percent:exit_velocity_avg:year                   2.153e-03  -0.233
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg:year  6.097e-05   0.254
##                                                            Pr(>|t|)
## (Intercept)                                                   0.807
## hard_hit_percent                                              0.816
## sweet_spot_percent                                            0.828
## exit_velocity_avg                                             0.794
## year                                                          0.807
## hard_hit_percent:sweet_spot_percent                           0.816
## hard_hit_percent:exit_velocity_avg                            0.798
## sweet_spot_percent:exit_velocity_avg                          0.817
## hard_hit_percent:year                                         0.815
## sweet_spot_percent:year                                       0.827
## exit_velocity_avg:year                                        0.793
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg         0.801
## hard_hit_percent:sweet_spot_percent:year                      0.815
## hard_hit_percent:exit_velocity_avg:year                       0.797
## sweet_spot_percent:exit_velocity_avg:year                     0.816
## hard_hit_percent:sweet_spot_percent:exit_velocity_avg:year    0.800
## 
## Residual standard error: 0.02591 on 352 degrees of freedom
## Multiple R-squared:  0.1873, Adjusted R-squared:  0.1527 
## F-statistic:  5.41 on 15 and 352 DF,  p-value: 7.457e-10
## [1] 0.1873352

So that sucked

  • let’s try another model

  • Throw Everything on the pitching side.

  • on_base_percent ~ fastball_avg_speed * breaking_avg_speed * offspeed_avg_speed * fastball_avg_spin * breaking_avg_spin * offspeed_avg_spin

  • rsquared?

## [1] 0.3821325

So that still sucked

  • Let’s try again. Throw everything on the hitting side.
## [1] 0.3544678

Baseball is hard

Why are we having so much trouble?

Concluding graph

##What did we find out? Projecting pitching success in baseball is very unpredictable. To further support this argument consider Jacob deGrom’s cy young winning year in 2018 compared to Bob Welch’s Cy young winning year in 1990.

##Where can we go from here? After deliberation and talking more about why we couldn’t find anything useful in our research, we concluded that there are too many outside variables that can factor in to player success. Team performance is the most essential variable we could think of. deGrom’s Mets finished with 77 wins and was near the bottom of the MLB in many major ofensive categories. Welch’s A’s were the best team in baseball and finished with 103 wins and was at the top of many offensive categories. Potential for a new research topic would be the idea a new statistic called the adjusted win statistic that measures pitcher worth while being backed by the league average offense and defense. This is just one thing that could be explored in the future!

#Thank You!